New Insights from an Analysis of Social Influence 
Networks under the Linear Threshold Model 



Srinivasan Venkatramanan 

Department of Electrical Communication 
Engineering 
Indian Institute of Science 
Bangalore 
vsrini@ece.iisc.ernet.in 



Anurag Kumar 

Department of Electrical Communication 
Engineering 
Indian Institute of Science 
Bangalore 
anurag@ece.iisc.ernet.in 



O 

(N 



q 

O 



> 

in 

m 
m 

(N 
O 
O 



X 



ABSTRACT 

We study the spread of influence in a social network based 
on the Linear Threshold model. We derive an analytical 
expression for evaluating the expected size of the eventual 
influenced set for a given initial set, using the probability of 
activation for each node in the social network. We then pro- 
vide an equivalent interpretation for the influence spread, in 
terms of acyclic path probabilities in the Markov chain ob- 
tained by reversing the edges in the social network influence 
graph. We use some properties of such acyclic path proba- 
bilities to provide an alternate proof for the submodularity 
of the influence function. We illustrate the usefulness of the 
analytical expression in estimating the most influential set, 
in special cases such as the UILT(Uniform Influence Linear 
Threshold), USLT(Uniform Susceptance Linear Threshold) 
and node-degree based influence models. We show that the 
PageRank heuristic is either provably optimal or performs 
very well in the above models, and explore its limitations in 
more general cases. Finally, based on the insights obtained 
from the analytical expressions, we provide an efficient al- 
gorithm which approximates the greedy algorithm for the 
influence maximization problem. 

Categories and Subject Descriptors 

F.2.2 [Analysis of Algorithms and Problem Complex- 
ity]: Non- numerical Algorithms and Problems 

Keywords 
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1. INTRODUCTION 

A social network models a set of entities (such as individ- 
uals or organizations) that are tied by one or more types 
of interdependency (such as friendship, collaboration or co- 
authorship). Typically each individual is a node in the so- 
cial network, and there is an edge between two nodes, if 



there exists some form of interaction between them. Real 
world social networks such as scientific collaboration net- 
works, have been observed Q] to exhibit several properties of 
complex networks, such as scale- free degree distribution and 
the small-world phenomenon. Given a social network, there 
are several well established node-selection heuristics such as 
degree centrality and distance centrality whose effectiveness 
have been analysed in [2J. In this paper we analyze and de- 
rive new insights on the spread of influence under the Linear 
Threshold model studied by Kempe et al. [§]. 

Related Literature: Social networks play a fundamental 
role as a medium for the spread of information, ideas and 
influence among its members. Network diffusion processes 
have been investigated extensively in the past, with focus 
on spread of epidemics, diffusion of innovation and decision 
models. The concept of using threshold models to explain 
collective behaviour was first put forward by Granovetter 
in [4], where he discusses the spread of binary decisions, 
among a group of rational agents, for instance in voting 
models. Similar behaviours can also be observed in cases of 
innovation adoption, rumour and disease spreading. New- 
man [5] studied the spread of disease on networks under the 
susceptible-infected-removed (SIR) model and showed how 
concepts from percolation theory can be used to study these 
models on a wide variety of networks. 

Domingos and Richardson [6l [7] were the first to study infor- 
mation diffusion under the viral marketing perspective, and 
they proposed the concept of a customer's network value, 
apart from his intrinsic value. They were also the first to 
pose the combinatorial optimization problem of choosing the 
initial set of customers to maximize the net profits, and 
showed that choosing the right set of users for the market- 
ing campaign could make a large difference. Kempe et al. [8] 
studied the problem of choosing the most influential initial 
set using two different models of information propagation, 
namely the Linear Threshold model (LT model) and the In- 
dependent Cascade model (IC model), and showed that the 
problem is NP-hard and the objective function is submodu- 
lar. They proposed a greedy approximation algorithm that 
was shown to achieve an approximation factor of (1 — 1/e). 
They also provided generalizations of the two models, and 
showed how the two generalized models can be made equiv- 
alent. 



Web page ranking algorithms such as Google's PageRank 



[llj can also be extended as a heuristic to the social network 
context, for ranking nodes in order of influence. Kimura et 
al. develop upon the Independent Cascade model intro- 
duced in [Sj and suggest two special cases of the IC model, 
which are computationally more efficient, and are good ap- 
proximations to the IC model when the propagation prob- 
abilities are small. Kimura et al. [TJ have also used the 
concept of bond percolation, to easily evaluate the expected 
influence of a given set of nodes, and hence proposed a faster 
version of the greedy algorithm. In [TS] the authors propose 
a general framework for cost effective outbreak detection, of 
which the influence maximization problem is a special case, 
and, by exploiting the submodularity of the influence func- 
tion, propose the CELF algorithm which achieves close to 
greedy algorithm performance. Wei Chen et al. [T7] study 
the IC model and propose an improved version of the greedy 
algorithm and also the degree discount heuristic which are 
found to perform on par with the greedy algorithm. 

Our Contributions:We develop upon the Linear Thresh- 
old model studied by Kempe et al.[8]. Our major contribu- 
tions are as follows: 



• We derive recursive expressions for the expected influ- 
ence of a given initial set (in Section [3]), provide an in- 
terpretation via Acyclic Path Probabilities in Markov 
chains, and provide an alternate proof of submodular- 
ity of the objective function (in Sections [4] and [5| . 

• We provide some sample cases where the PageRank 
algorithm is provably optimal or performs very well (in 
Section [6]l and subsequently we discuss the limitations 
of PageRank in more general cases. 

• We also propose the Gl-Sieving algorithm to find the 
most influential set, based on the insights derived from 
the recursive expression(in Section 0) and find that 
Gl-sieving performs almost on par with the Greedy 
algorithm and is also very efficient in terms of compu- 
tation. 



2. THE SOCIAL NETWORK MODEL 
Glossary of Notation 

M - weighted directed graph of the entire social network 
M\A - graph obtained by removing nodes in A C M and all 
links to or from these nodes 

W - influence matrix with Wi,j as entries, gives the edge 
weights of M 

Oj - U[0, 1] random threshold chosen by node j 



total influence into node j from set A 
Ao - Initial active set 
Ak - Set of all active nodes at time step k, Ao C Ai C A2 . . . 
D k - Set of nodes which were activated at time step k, 
D k = A k \A k -i 

S - Random time at which the activation process stops, 
S = mm k {A k = A k -i} 

gf' A) (k) = p(^)(j G D k ) — pW(j G D k \A = A) 



gy - ' =¥^>(jeA s ) 
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Social Network Description. In this work, we adopt a 
model in which a social network is a weighted directed graph 
M = (V, E), where the edge weights Wij give a measure of 
influence of node i on node j. The activation process begins 
with an initial set of active nodes Ao , and at each step k the 
set of active nodes A k keeps increasing, due to the influence 
of the already active nodes. This goes on until a terminal 
set As is reached, from where the activation process cannot 
proceed further. We shall focus only on the progressive case, 
where nodes once activated, will never switch back to the 
inactive state. 



Activation Models. There are two widely used activation 
models, namely, Linear Threshold model and Independent 
Cascade model. In the Linear Threshold model, we ensure 
that Wy < 1. In this model, each node j randomly 

chooses a threshold Qj uniformly from [0,1] at the beginning; 
At step k, a node j gets activated if, it had been inactive 
until step k — 1 and 



E 



> e, 
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In the Independent Cascade model, we start with an initial 
active set Ao and the activation proceeds according to the 
following randomized rule. Whenever a node i becomes ac- 
tive at step k, it is given one attempt at activating each of 
its inactive neighbours j, succeeding with probability Wij. 
If i succeeds, j becomes part of A k +i, but whether or not i 
succeeds, it cannot make any more attempts at activating its 
neighbours in subsequent rounds. Again, the activation pro- 
cess continues until no more activations are possible. Kempe 
et al. also provide generalizations of the above two models in 
[8], and show how the two generalized models can be made 
equivalent. In the remaining sections of this paper, we shall 
be discussing only the Linear Threshold model. 



Problem Statement. Given the initial set Ao, the activa- 
tion process evolves in discrete time steps according to the 
Linear Threshold model. Let A k denote the set of all ac- 
tive nodes at time k. Since we are dealing with the pro- 
gressive case, it is clear that Ao C Ai C . . . C M ■ Let 
D k denote the set of nodes which were activated at time k, 
i.e., D k = A k \A k ^i and Do — Ao- Let S denote the ran- 
dom stopping time at which the activation process stops, 
i.e., S = min k {A k = A k _x}. Then we can define a^'^ = 
E (M,A ) y^ s |] tQ be the ex p ec ted size of the terminal set As, 
starting with Ao as the initial set in the network N ' ■ The 
influence maximization problem can then be formulated as 
follows: 



(1) 



(M ,Aq) 

s.t. Ao C M 

\Ao \ — K 



Greedy Algorithm. The greedy hill climbing solution for 
the influence maximization problem is shown in Algorithm[T] 



In the algorithm, K is the size of the required initial set, and 
the set X obtained after the K iterations is the greedy solu- 
tion. It is noted in 8 that this achieves an approximation 
factor of (1 — 1/e), and the proof involves the submodularity 
and monotonicity of o~( ' A \ 



X^0; 

for i = 1 to K do 

Choose Vi such that Vi = argmax„ (J ^' XUv ' > ; 
X ^ X Uv,; 
end 



Algorithm 1: Greedy Algorithm 

3. RECURSIVE EXPRESSION FOR a (Af > A ^ 

As far as we know, there is no work on mathematically char- 
acterising the value of a^' A °^ for the models introduced in 
[5]. Moreover, a^' A °^ is generally obtained by simulating 
the activation process several times on the social network, 
and taking the average value. In this section, we derive 
an expression for a^'^ in recursive form, and hence give a 
general expression for a^' A °K We use this expression later 
to provide insights into various existing heuristics, and also 
for proposing an efficient algorithm that matches the greedy 
solution. Let us begin with the definition of o-^' A °\ 

a^' Aa) =E^' Ao) [\A s \} 

Note that, since D^'s are disjoint, and {J^L Dk = As, we 
can write, 



|JV| 

fc=0 jGJV 
|AT| 

EEE (^o) [W}] 

fc=0jgAf 
|JV| 



EE 

k=0jeM 



(k) 



In the above expressions, I{e} denotes the indicator variable 
for the event E, and we also use the fact that the total num- 
ber of time steps of the activation process is bounded above 



by the number of nodes in the network, \AT\. g 



(k) 



gives the probability that node j is activated at the time 
step k, given that we start with Ao as the initial set in the 
network M. We wish to state the following lemma, which 
will help us determine gj^' A °\k). 



Lemma 3.1. 1. j e Ao, 



(a) gf> A °\o) = l 

(b) gf' Aa \k) = , for allk>0 



2. j £ Ao, 

(a) g^' A °\o) = 



(b) gf' Ao \k)= £ gf^ A °\k-l)w Uj ,for 
;gat\{j} 

all k > 

Proof. Note that 1(a) and 2(a) are obvious, since Do = 
Ao, chosen deterministically. 1(b) follows from 1(a) and the 
observation that YLkLo 9j^ (k) < 1 by definition. For 
2(b), since Ao CjV\{j}, 

gf' Aa \k) = p(^\«>-^) (b 3 {A k - 2 ) < Q j < h{A k ^)\ 

Since D^-i = J 4fc_i\^4fc_2, and Bj is chosen uniformly from 
[0,1], we can write, 



gf' A °\k) = E^^MDk-i)) 

= e 9 r {3hAo) (k-i)w l , j 



□ 



3.1 Singleton Initial Set 

Now, by Lemma [3.11 we can write, For j ^ i, 



gf-\2) 



9 kl W w ki,i 



feieAT\{i} 

kieM\{i.j} 

- E E gC^'H^M^ 

— E E WiM W k 2 ,ki'Wk 1 ,j 

= E E Wi, kl w kxM w k2 ,j 

In the each of the steps, we substitute the expression for 



g^ ,l, (k) and noting that gf' i] (k) = for k > 0. The last 
step is obtained by suitably rearranging the terms. 

Note that, the above terms can be understood, as the in- 
fluence of node i reaching node j through a path (without 
loops) of k hops. We can use this to derive the recursive 
equation for cr^'^.We have, 



T (A/\i) 



k — O j£J\f 

oo 

= 1+ E E sf' l) (2) + --- 

= 1+ E Wi -i + E E UMiWfcLj H 

j£AT\{i} 2'eA^\{i}fel6JV\{i,j} 

By changing variables and rearranging summations, this is 
equivalent to 

ff (V,0 = 1+ E + 

fcieJV\{i} 

E E ^ fc i^' H — 

fciSA/\{i} jeAT\{i,&i} 



= i+ E Wi > fc i 1+ E w *i.fc2 

fcl£Af\{i} L fc 2 6Af\{i,fci} 



Note that this equation is recursive in nature, and hence we 
can state the following theorem. 

Theorem 3.1. Given a social network M , with influence 
matrix W, the total influence of any node i in the network 
under the LT model is given by 

Wija^W™ (2) 



a^ = l + e 



The equation says that under the linear threshold model, 
the total influence of any node i in the network, is one (for 
the node i itself) plus the weighted sum of the influences of 
its neighbours in the network without i. 

3.2 Initial Set Ao 

A similar derivation can be done for any Ao- In this case, 
again using Lemma 1, we get 







(M,A ) 



9 



{M,A ) 



E m 'i 

(2) = E E w iM w kltj 

(3) = E E E w iM w kli k 2 w k2tJ 

i£A k^j kx^Ao k 2 ^j,k ± k 2 <£A 



and so on. Note that these terms can be understood, as the 
influence of nodes i £ Ao reaching node j through a path 
(without loops) of k steps, without passing through any other 
node in Ao- 

Also, having chosen Ao, the edge weights {wij,j € -4o} do 
not have any effect on a^' A °K By the above two obser- 
vations, we can thus divide the problem of finding 



into K subproblems, where K — \Ao\- Define sub-networks 
Nf° , for all i € Ao, such that, 



Mf° = {Af\Ao} U {i} 
Then we can see that, 

ff r^ w =E^°' i) w 

ieA 

Now we can state the theorem as follows. 



Theorem 3.2. Given a social network J\f, with influence 
matrix W, the total influence of any initial set Ao in the 
network under the LT model is given by 



ieA 



(3) 



Each of the terms in the right hand side, can then be eval- 
uated recursively using Equation [2] 

4. INTERPRETATION VIA ACYCLIC PATH 
PROBABILITIES IN A DTMC 

To begin with, since w i,j — 1j W in general need not 

be a stochastic matrix. In order to interpret the expressions 
for g^' Ao) in the Discrete Time Markov chains(DTMC) 
framework, we require P = W T to be row stochastic. Hence, 
we can set Wjj = 1 — w i,j ■ Note that this does not 

affect the theory developed till now, since terms of the form 
Wi : i do not feature in Equations[2]and[3] We shall also adopt 
a similar approach when we use the PageRank algorithm, 
where we will be calculating the stationary probability of a 
DTMC. 



From Lemma 13.11 for all j ^ i 



g^\k) 



E E 



E 



Writing P = W T and interpreting P as a transition prob- 
ability matrix for the DTMC {X k }, obtained by reversing 
the edges in the social network, we have, 



g^\k) 

= E E ■•■ E Pi,hPh,h ---Plk-ui 

= P({A m e Af\{i},0 < m < k,X k = i, 

Xo^Xi ^■■■^X k }\X =j) 

= ■ Ck{j~>i) 



and similarly, 



= V{{X m e M\A ,0 < m < k, X k € Ao, 

Xo^Xi ^...^X k }\X =j) 

=: c k (j -¥ Ao) 



Define, 



Here c w (j A 2?) is the probability in the DTMC of reaching 
the set T> for the first time, through an acyclic path via node 
v, using nodes only from W as intermediate nodes, given 
that we start from node j. If W = M, we do not explicitly 
mention it in the notation. Also, v is an optional argument, 
which when omitted, removes the constraint that X u = v 
for some < u < k. In all useful cases that we consider, we 
assume v,j <£T> and also v, j G VV. 

Some properties of Acyclic path probabilities are as follows: 



c(j ->■ Ao) = J2ck{j -> Ao) 

In the DTMC represented by P, given we start from state 
j, c(j — > i) denotes the probability of hitting state i for the 
first time through an acyclic path from j, and c(j — ¥ Ao) 
denotes the probability of hitting the set Ao for the first time 
through an acyclic path from j. Since g^' A °'> — c (j — > Ao) 
we have, 



Lemma 4.1. c(j ->• A) = 1, for all j e A 



Proof. 



i Ai i + c C? A °) 



(4) 



Now the influence maximization problem can be restated as 
follows: 



Let W denote the influence matrix for the given problem. 
For the DTMC over the finite state space N , with transition 
probability matrix P = W T , choose A C N, \A\ = K, such 
that J^jg^c c (i — ^ A) is maximized. The set A thus ob- 
tained is the solution to the original influence maximization 
problem. 

4.1 Properties of Acyclic path probabilities 

We shall use the interpretation in terms of acyclic path prob- 
abilities in the DTMC, to provide an alternate proof of sub- 
modularity of a < - J ^' A "' 1 in the next section. In this subsec- 
tion, we state and prove a few properties of such acyclic 
path probabilities. Let us first introduce a more elaborate 
notation. 



c(j^A) = {X m € M\A,Q < m < k}, 

k=0 ^ 

x k eA,x ^---^x k \x =j 
= P(x e A\Xo = j) + 



e N\A, {X m e Af\A, < m < k}, 



X k €A,X ^X 1 ^---X k \X =j 



Since j G A, all the terms in the summation are zero, and 
hence, c(j ->■ A) = P(X G A\X = j) = 1 □ 



c W U^V) 



(J {X m G W\V, < m < k, X k G V, 

k=0 

Xo =fi Xi ^ • • • ^ X k , 
3 < u < k, s.t. X u = v}\X = j 

i X ™ e W\f,0 < m < k,X k G V, 

k =o ^ 

Xo =fc Xi =fi ■ ■ ■ =fi X k , 
3 < m < k, s.t. X u = v}\X = 



(5) 



Lemma 4.2. c(j -> A) = c(j A A) + c^ l " } (j -> ,4) 



This property means that the probability of reaching A 
starting from j through an acyclic path, can be split into 
the probability of those paths via node v and those that 
avoid node v. 



Proof. 



c(j -> A) 

oo , 

= ^2vl{X m eM\A,0<m<k,X k eA, 

k=0 ^ 

X ^X l ^-.-^X k }\X =j 

oo . 

= ^Pf {X m eM\A,0 <m< k,X k e A, 

k=0 ^ 

Xq 7^ Xi X k , 

3 < u < k, s.t. X u = v}\X =j) + 



E p ( eJV\A,0< m<k, 

k=0 ^ 

x k eA,x ^x 1 ^---^x k , 

$0<u<k, s.t. X u = v}\X = /) 

c(j ^A) + 

oc , 

^2 1" ( {X m £ < m < k, X k e A\{v}, 

k =o ^ 

X ^X 1 ^---^X k }\X =i) 



Since, v (f: A, A\{v} — A and we get, 

c(j ->A) = c(j ^A)+ c^™ (j ->A) □ 



Lemma 4.3. c(j ->• AjM) = c^t? -»• u)+c Ar \ {t,} (j 
A), /or allv A 



Proof. 
c(j -»• .4 u {«}) 

OO ✓ 

= {X m £AA\(^U{«}),0<m<fc,X fc G.4u{«}, 

X ^X 1 ^---^X k }\X =ij 

OO ✓ 

= E P ( {X m € (A/\4)\M,0 < m < fc,X fc = v, 
X t£ X! t£ ■ ■ ■ ^ X k }\X = j^j + 

OO / 

X>MX m € (A/"\M)V4,0 < m < k,X k e A, 

k=0 ^ 

Xo^X 1 ^---^X k }\X =i) 



The second equality results from the assumption that v ^ A. 
Note that this property can be extended to c(j — > .4 U £>), 
where „4 and 23 are any two disjoint sets. □ 



Before stating and proving the next property, we wish to 
prove the following two sublemmas. 



SUBLEMMA 4.1. C(j -»■ V) =T,L'CM-{j,v}^Len(L')PjAL) 

where n(L') «s the set of all permutations of L' , and for 
L = {h, h, ■ • • h-i}, 



Proof. 

c(j -¥ v) 



■pi k -i,v if L = {h, . . .l k -i} 



(6) 



oc , 

= E P ( Xl = ^> X 2 =h,...X k -i =l k - U X k =V, 
k =l ^ 



h h =£ ■ ■ ■ h-i j v\X = j 

E E ^>( L ) 
L'cM-{j,v} Len(L') 



The second equality is by using the chain rule and the Markov 
property of X k . □ 



This property means that the probability of reaching _4U{i>} 
through acyclic path, can be split into the probability of 
reaching A avoiding node v, and that of reaching node v, 
avoiding set A. 



SUBLEMMA 4.2 

c(j *A) = 



e e ft,4^ (i ' ua) M^) 



L'CM\A-{j,v} Len(L') 



Proof. 

c(J A A) 

co , 

= ^PI{I m G A/"\.4,0 < m < k,X k £ A, 

Xo =fc Xi =fc ■ ■ ■ =fc Xk , 

3 < u < k, s.t. X u = v}\X = j 



E E ^2^[Xi=h,...,X v ,- 1 =l u - 1 ,X u = v, 

k=0 L'CAf\Av.=l ^ 

h £ A, {X m £ Af\A, U < m < k},X k £ A, 

X / Xi 7^ ■ ■ ■ 7^ X k I X = j 

oo k — 1 , 

E E ^2^(xi = h,...,x u - 1 = i u - 1 ,x u =v, 

k=0 L' CAf\A "=1 ^ 

h=£ l u -i,h v y^j, hi A, 

{X m £ {Af\{L'uj})\A,u<m<k},X k eA\X =j\ 



, X u -i — l u -i, X u — v, 



E E = 

k=0 L'CM\Au=l 



h ^ ... ^ l u -i,U ^v^j,kg A\X = j J x 
P^{X m £ (N\{L' u j})V4, it < m < k}, 

X k £ A\X U = v 

y: e p,,(l)c^ l, ^(v^a) 

L'CAf-(AU{j,v}) Lsn(L') 



where pj tV (L) defined as above in Equation [6] and L' = 
{h, . . . l u -i}- The penultimate equality is by applying Markov 
property at X u = v. □ 



Now we can state and prove the next property. 
Lemma 4.4. c(j A A) < c^^ A (j v) 

Proof. 

c[j A) 

E E pUl)c^ l '^(v^a) 

L'CAT~{AU{j,v}) Len(L') 

< E E pa«w 

L'CAf-(-4U{j,u}) tgn(L') 

In the above proof the equalities are due to Sublemmas 14.1 
and|4~J The inequality is because c Arx( - L ' u{i}) (v -> *4), 
being a probability term, is less than 1. □ 

Lemma 4.5. For AQB, c M \ A (j -*■«)> c^C? ->■ v) 



Proof. Note that, 



Y, E Pj>v(L) (7) 

L'QM-(Au{j,v}) Len(i') 



In the expression on the right, as A increases, the number 
of possible L' decreases, hence c^ A (j — > v) decreases. □ 

5. SUBMODULARITY OF ff W*>> 

In this section we shall prove the submodularity and mono- 
tonicity of acyclic path probabilities and using them, we 
prove the monotonicity and submodularity of a^' A °\ 



Lemma 5.1. c(j — !> .4) is monotonically increasing in A. 
i.e., For A C B, c(j -> A) < c(j -> B) 

Proof. We need to check, 

c(j ->AU {«}) - c(j ->■ A) > 

Substituting for c(j — > A U {v}) and c(j — ^ .4) from Lem- 
mas [472] and |4~3] we have 



c(j^Au {v})- c (j->A) = c MXA (j ^ v) - c{j ^ A) 

> 

The last inequality results from Lemma 14.41 □ 

Lemma 5.2. c(j — > .4) is submodular in A, i.e., c(j — > 
A U {v}) — c(j — > .4) decreases, as A increases. 

Proof. 

c(j -^AU {v}) - c(j -> A) = c^ A {j -> «) - c(j A .4) 

E E Po-v( L )- 

L'QM-(AU{j,v}) Lgn(L') 

£ E PUL)c^ L '^\v^A) 



L'CM-(AU{j,v}) Lgn(L') 

E E Pi> v ( L ) 

L'QM-(AU{j,v}) LeH(L') 



l-c^ L ' u{ ^(v^A) 



For a fixed L' , as increases, c^^ L u ^^(v — > A) increases 
by Lemma 15.11 and hence the entire term inside the sum- 
mation decreases, as A increases. Also, as A increases, the 
number of L' that satisfy the constraint in the first sum- 
mation also decrease. Hence, c(j — > A U {v}) — c(j — s> A) 
decreases, as A increases. Thus c(j — > A) is submodular. □ 

5.1 Monotonicity and Submodularity of a {U - Ao) 

Recalling Equation [4] 



a^' Ao) =\Ao\ + E C C?^A) 



Since c(j — > A) = 1, for all j 6 A by Lemma 14.11 we can 
write 



c(j — > ^4o) is monotonically increasing and submodular in 
Ao by Lemmas 15.11 and 15.21 and since er^'^ ' is a non- 
negative linear combination of such c(j Ao), this au- 
tomatically proves the monotonicity and submodularity of 



6. EXAMPLES 

In this section, we use the analytical expression to obtain 
the optimal initial set for some simple LT models. We also 
show that PageRank matches with the optimal solution ob- 
tained from the analytical expression in 2 cases. We pro- 
vide simulation results for those cases in which PageRank is 
not optimal, but provides a very good approximation of the 
greedy solution. 



h m (a, 13) 

foil x ■ ' ■ x Pii m +i ^ ( a J'i ' ' ■ • ' a ii 

{li,.-,i m +i}£{i,...,fc} 

where, 

f°(xi, ... ,x t ) = 1 

and for all m > 0, 

/ m (xi, . . . ,x t ) = ml ^2 ViX---XVm 

Proof. The proof is by direct application of Equations[2] 
and [3] □ 



6.1 UISLT Models on a Complete Graph 

We introduce a simple version of the Linear Threshold model, 
called the Uniform Influence-Susceptance Linear Threshold 
model (UISLT). In this model, we have two parameters a; 
and Pi associated with the node i, a measure of the level of 
influence and susceptance of the node i. The social network 
is a complete graph with the matrix W defined as follows: 
For all i,j with i ^ j, 

u>ij — (Xi x for all? ^ i 



USLT model. For the Uniform Susceptance model, from 
the above general equation for UISLT, by setting a = (1, . . . , 1) 
and noting that, 



/ m (Q Jii ,...,a 3lm+i ) = (m + 1)! 



A, 



we get, 



and 



W i,i = 1 ~ W i' i 

Note that oti's and /Vs are chosen such that, Y2j^i w i,i — 1- 
This implies that, for all i, 



2>4 



Theorem 6.1. Let N = Ao U j%, ■ ■ • ju}- where K — 
\Ao\ and k — \j\f \ — K . Then the total influence of the initial 
set Ao in the UISLT model is given by 



k — l 

a (M,A ) = \Ao\ + \Ao\Y,f m+1 (P) 

m=0 

where / m is the same as defined earlier. 

Hence in this case a^' A o) j s an increasing function of /?. 
Thus to maximize o-^' A °\ we need to pick _4o such that, 
the nodes with maximum Pi are left out. Thus the optimal 
Ao in this case, is the set of K nodes with least Pi values. 
This also makes intuitive sense, since by picking this Ao 
(the least susceptible nodes), we ensure that the inactive 
nodes are the most susceptible ones, and hence maximizing 
expected cardinality of the terminal set. 

It turns out that when we apply the PageRank algorithm, 
we get the stationary probability as, 



where a = { ajl ,a j2 ,--- a Jk } , /3 = {P h , fy 2 , ■ ■ ■ p jk } , a Ao = 

T, teAo a *> and 



Thus choosing the nodes with top-k ni yields us the same 
optimal set, i.e., the set of K nodes with least Pi values. 



UILT model. For the Uniform Influence model, from the 
above general equation for UISLT, by setting a — (1, ... ,1), 
we get, 

h m (a) 

= f m ( a 3t 1 >---> a ii m+1 ) 

{h,...,l m+1 }C{l,...,k} 



Hence, 

h m (a) = (k- m)/ m (a J1 , . . . ,a jk ) 



fc-i 

m — l 

Hence in this case a < - AJ '- Ao '> is an increasing function of ctA , 
fixing _4o| to be K. Thus to maximize a^' A °\ we need to 
pick Ao such that ctA is maximized. Thus the optimal Ao in 
this case, is the set of K nodes with highest «i values. This 
also makes intuitive sense, since a; is a measure of influence, 
and the Ao thus obtained would be the set of most influential 
nodes. 

It turns out that when we apply the PageRank algorithm, 
we get the stationary probability as, 



Ui 

71"; = ^ 

Thus choosing the nodes with top-k iti yields us the same 
optimal set, i.e., the set of K nodes with highest a; values. 

6.1.1 UISLT model and PageRank 

But in general, the PageRank algorithm need not be optimal 
for the UISLT case. For PageRank, in the UISLT case, the 
stationary probability is given by, 



w< 

Hence one might suspect that picking the nodes in increasing 
order of on /Pi could be optimal. But it turns out to be false, 
since a node with fit very close to zero could get chosen as the 
most influential node irrespective of its cti . If we restrict the 
fli, by not allowing it vary much, we see that the PageRank 
algorithm gives a very good approximation of the greedy 
solution. 

The following simulation was conducted on a complete graph 
with 50 nodes with the UISLT model. The o^'s were picked 
at random from a uniform distribution over [0, 1] and /3i's 
were picked with a uniform distribution over [ . * 0,5 — , - 

It is found that PageRank performs on par with the greedy 
algorithm. Results are shown in Figure [T] 
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Figure 1: UISLT on a complete influence graph of 50 
nodes, with a^'s and /3j's being picked as described 
in Section [5TT7I1 

6.2 Node Degree based Model 

In this class of models, we start with an undirected graph 
without self-loops, whose adjacency matrix is given by A. 
We then generate the influence matrix W by normalizing 
the adjacency matrix as follows: 



w i,j = Odd I dj (8) 
where dj — J]\ dij is the degree of the node j. 

Let us restrict our attention to acyclic graphs. We then have 
the following theorem. 

Theorem 6.2. Consider an acyclic undirected graph Af 
represented by the adjacency matrix A. Let the influence 
matrix be generated by Equation^ Then, for any node i G 
M, 

aW.<i =di + \ 

Proof. Given the acyclic graph M and the node i, view 
the graph as a tree T of depth D, with node i as the root. 
For any node j in the tree T, let P(j) be the parent of node 
j in T, and C(j) be its immediate child nodes. Define, 



L o = {iGT:C(j) = 0}f ^0 
and for < k < D, 



k-l 

L k = {j G T : C(j) C U L t , C(j) n L k -x * 0} 
t=o 

Hence by definition of depth D we have, Ld = {i} and it is 
easy to see that Lk 's partition nodes in AT into sets of nodes 
having the same depth. 

By Equation [2] we have, 



E 



iec(i) 

where P(j) = i and j 6 io U 



■••UL fl -i 



a W\PU)j) (9) 



We shall prove inductively that, Vj, 



(10) 



We know that if j € Lo, then it is true, since in that case 
a (AT\P(j),j) = i an d C(j) = 0. Assume that the claim is 
valid for j £ Lo, Li, ■ ■ ■ , L k . 

For j G Lfc+i, 



T (.M\PU),j) 



1+ E 
i+ E 

!SC(j) 

1 + |C0')I 



|c(0l + i 
|C(z)| + i 



r (JV\P((),l) 



|C(0I 



The second equality in the above set of equations, is because 
Claim [TOl is valid for I G Um=o^m- Thus substituting the 
above in Equation [9] we have 



\C(i)\ + l = di + l 



□ 



Thus it is found that the most influential node is the node 
with the highest degree. In order to pick the second node for 
the greedy algorithm we need to maximize '• lU: >' > where i 
is the node with the highest degree. By using Equations [2] 
and Owe can write, 



r {M\j,i) 



+ Wi 



(Af\i,i) 



From the above expressions, we find that if i and j are high 
degree nodes, their net influence can be approximated well 
by the sum of their individual influences, since Wij and Wj,i 
are small. Extending this further, we hence see that the 
solution of picking the high degree nodes will give us a very 
good approximation of the greedy solution. 

By applying the PageRank algorithm using P = W T as 
the transition probability matrix, where W is as defined in 
Equation [5] we get the stationary probability to be, 



We also tested PageRank algorithm on undirected graphs 
which may have cycles. It turns out that there is still a 
high correlation (see Figure [5]) between the degree of a node 
and its individual influence. Hence even in this case we can 
pick nodes in the decreasing order of degree to get a good 
estimate of the optimal initial set. 
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Figure 2: Scattergram between degree and influence 
of a node under the degree-based influence model in 
an Erdos-Renyi random graph of 100 nodes 



7. IMPROVING THE GREEDY ALGORITHM 

As shown by Kempe et al. the greedy hill-climbing solution 
achieves (1 — 1/e) approximate solution for the influence 
maximization problem [8]. But finding the initial set us- 
ing the greedy algorithm is computationally quite expensive. 
There have been several efforts in the literature to improve 
the execution speed of the greedy algorithm. In this section, 
based on the insights obtained from the analytical expres- 
sions, we provide two techniques, namely thresholding and 
restriction, which achieve a very close approximation to the 
greedy solution. 

During the greedy algorithm, the first stage involves compu- 
tation of <j < - J ^ ,l ' > for all nodes i £ N. One can rank the nodes 
in decreasing order of individual influences to yield a rank 
list which we shall refer to as Gl list. One possible solution 
for influence maximization is to pick the top K nodes to be 
the initial set. This solution is not as effective as the greedy 
solution, because it may contain dummy nodes, i.e., nodes 
which in spite of having a high individual influence, fail to 
provide high marginal contributions to the greedy solution 
set and hence get rejected by the greedy algorithm. 

We classify the dummy nodes into two categories, namely 
the leechers and subordinates. Let X denote the set of nodes 
in Gl which are above i and have been picked to be part of 
the initial set. Then a node i is a leecher to set X, if 



i ^2 w z>j a 



(AA\ij) 



and we find that PageRank algorithm matches with the 
heuristic of choosing the nodes according to degree and hence 
will give a good approximation of the greedy solution. 



This means that the node i will have almost nil marginal 
contribution to a set which already contains X, since it pri- 
marily derives all its influence from those nodes. 

We define a node i to be an a-subordinate of set X if 



91 >a 

This means that node i gets activated at least a fraction 
of the time when we begin with X as the initial set. Thus, 
for high enough a, the marginal contribution of this node to 
the set X will be much smaller than its individual influence. 
If one can identify and eliminate such leechers and subor- 
dinates from Gl while picking the initial set, then a more 
effective initial set could be obtained. 

We use two techniques namely thresholding and restriction 
to filter out the subordinate nodes and leechers respectively. 
Thresholding involves comparing the g\ with a and re- 
striction involves evaluating cr^^ x '^ to pick the next best 
node. One can also choose to use only one of those tech- 
niques, with slightly reduced effectiveness. 

7.1 Gl-Sieving Algorithm 

In this algorithm, we first evaluate o-'"^' 8 ' for i g M and 
obtain the Gl list. We start with X = {i}. We remove 
the nodes which are a-subordinates of set X by evaluating 
9i^' X ^ f° r au nodes and comparing with the threshold a. 
For the remaining nodes, we compute <j( J ^\ x ^ and pick the 
node that has the highest value. One can also discard the 
nodes which have a value very close to zero, since these nodes 
are the leechers. We add the picked node to X and repeat 
the procedure until we have a set of size K or we exhaust 
the entire list. The algorithm is shown in Algorithm [2] 



8. SIMULATIONS 

8.1 Coauthorship Networks 

Newman Q] observed that scientific colloboration networks 
are excellent examples of social networks. In such networks, 
each node represents an author in the scientific community 
under consideration, and an edge exists between two nodes 
i and j if those two authors are listed as co-authors at least 
in one of the papers. Newman in [18] explains a method 



by which the strength of collaboration (symmetric) between 
two authors can be extracted. We use this data to obtain 
the Linear Threshold model parameters. The process is as 
follows. 

Let 1Z denote the set of all papers under consideration in the 
scientific community, excluding the papers that only have a 
single author. For each r € 1Z, let n r represent the number 
of coauthors for paper r. Let M represent the union of all 
authors of the papers in 1Z. Define S(i,r) to be 1 if author 
i was a co-author of paper r and zero otherwise. Then ua^ 
representing the strength of collaboration between authors 
i and j, for i ^ j, is given by, 



w ^ - 2^ n r -l 

We do not define terms of the form liij since they do not 
represent any measure of collaboration. Now , by using these 
Wij 's, we obtain the entries of influence matrix W by nor- 
malizing, i.e., 



1 I J 

Wi ' j = — 

The simulations in the following sections are carried out on a 
coauthorship network of NetScience community containing 
1589 nodes. 

8.2 Comparing PageRank with the Greedy Al- 
gorithm 

In Section|S]we examined various cases, where the PageRank 
algorithm was either optimal or performed very well. Here 
we report simulations comparing PageRank and Greedy al- 
gorithm on the Netscience dataset. They are also compared 
against heuristics such as the out-degree(obtained by count- 
ing the number of outgoing edges) and the weighted out- 
degree (summing up the weights on the outgoing edges). In 
the scenario where the lUjj's are obtained as above, it turned 
out that the PageRank algorithm's performance was much 
below that of the greedy algorithm. We ran another set of 
simulations where W was interpreted as directed, for exper- 
imental purposes. This means that the "strength of collab- 
oration" Wi : j between two authors i and j was completely 
assigned to the author with a higher index, i.e., for i < j, 
Wj,i = u>ij and Wij = 0. This would result in an influence 
graph which is directed and has no cycles. It was found 
that PageRank algorithm performed on par with the greedy 
algorithm. The results are shown in figures [3] and U 

We see that this is because, the PageRank algorithm es- 
sentially works with random walks on the given graph and 
finding the stationary probability. But as we have pointed 
out in Section 2] maximizing the spread of influence involves 
random walks that are "self-avoiding". Thus it turns out 
that in the directed case, where there are no cycles, PageR- 
ank algorithm works fine, whereas in the undirected case, 
it fails to perform on par with the greedy algorithm. Thus, 
even though it can be calculated efficiently to give a rough 



Evaluate a^'^ for all i G TV; 

Sort nodes in decreasing order of a^'^ to get Gl; 
X = Gl(l); 
Gl = G1\X; 
for c = 2 to K do 
if Gl = then 
| break; 
end 

for i e Gl do 

if gf' X) > a |[ trW*'" < e then 
Gl = Gl\{i}; 
continue; 
end 

Evaluate a ^ x ^; 
end 

v = argmaxigGi a^*' ; 
X = X U v; 
Gl = G1\H; 
end 



Algorithm 2: Gl-Sieving Algorithm 



Figure 3: Comparison of PageRank and Greedy - 
undirected case 
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Figure 4: Comparison of PageRank and Greedy - 
directed case 



estimate of the network effect of a node, in many cases it 
might perform poorly compared to the greedy algorithm. 

8.3 Comparison of Gl-Sieving with Greedy Al- 
gorithm 

We carried out two experiments with the NetScience dataset 
for the Gl-sieving algorithm. The first involved using only 
the thresholding technique with various values for a. It is 
interesting to note the change in performance of the thresh- 
olding technique with variation of a as shown in Figure [5] 
For low values of a, the algorithm retains only the nodes 
whose influence domains are almost disjoint, and hence per- 
forms badly. Also, for high values of a, the algorithm does 
not remove many nodes, and the list almost resembles the 
Gl list. It turned out that for this dataset, thresholding 
with a = 0.3 provided a very good approximation of the 
greedy solution. 

In the second case, we implement the Gl-Sieving algorithm 
with the threshold a = 0.3 and with restriction. It is found 
that restriction provides a slight improvement over the so- 
lution with only thresholding. The results are shown in Fig- 
ure [5] We find that Gl-Sieving algorithm performs on par 
with the greedy algorithm. 

When iCBCA/ 1 and set sizes are very small compared to 
|JV|, due to the monotonicity of the influence function, eval- 
uating cr^" 4 ' is faster than a^' B \ since the latter involves 
more number of activations, whereas for set sizes closer to 
\Af\, evaluating cr'^' 8 ' is faster than ct^' -4 ', since the for- 
mer involves fewer nodes to be activated. Also for a given 



Figure 5: Comparison of various sieving thresholds 
with Greedy algorithm 
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Figure 6: Comparison of Gl-Sieving with Greedy 
algorithm 

set size, the time taken for evaluating the influence of a set 
A increases with the expected influence of the set. 

Keeping these in mind, we see that Gl-Sieving algorithm 
runs much faster than greedy algorithm, since it evaluates 
influences for only sets with single node as the initial set, 
and as it proceeds it evaluates influences for fewer influ- 
ential nodes in a restricted graph. Also, the technique of 
employing only thresholding, performs almost on par with 
the Greedy algorithm, given the fact that it involves eval- 
uating (j'"^' 8 ' for all i £ Af, i.e., the first stage of greedy 
algorithm, and then subsequently evaluating influences (to 
get the activation probabilities g^' X ^ only once per round. 

For the Gl-Sieving algorithm, we obtained the optimal a 
using simulations. It would be interesting to look at ways 
to estimate a based on the graph structure. One can also 
use variable a, by having higher threshold for initial nodes 
while reducing it for later stages. 

9. DISCUSSION 

In this paper we have derived an analytical expression for 
the influence of a given set in a social network under the 
Linear threshold model. The insights thus obtained helped 
us propose a better algorithm for choosing the initial set 
to maximize the spread of influence. A similar approach 
could be adopted for the independent cascade model. This 
will help us explain why certain heuristics work well and 
also help in developing better algorithms. It is also to be 
noted that the current framework can be easily extended to 
the time constrained influence maximization problem, where 
the activation process is terminated after a fixed number of 
steps. Another interesting implication of this work is the role 



played by self avoiding random walks in the analytical ex- Physical review, 64, 2001. 

pression for the influence function. Finding an efficient way 
to compute these probabilities will speed up the influence 
computations. PageRank algorithm was found to be sub- 
optimal since it was working on the assumption of random 
walks which could involve cycles. As an interesting aside, 
one can even model the walk of random surfer on the Web 
graph to be a "self-avoiding" random walk which can have 
some implications on the Web-page ranking algorithms. 
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